Assume we have the correct model (this is a big assumption!)
How should we talk about things like uncertainty and error?
How do we make an argument for one model over another?
P.values
Confidence Intervals
S-Values and model comparisons
Bayesian alternatives
Can this British lady do wild stuff with her tea?
Fisher’s proposed experiment:
8 cups of tea. 4 milk first, and 4 tea first
Each cup presented in random order, and she’s asked to classify each one
The “null hypothesis”: she’s just guessing
The p-value is “the probability of correctly classifying at least N cups by guessing”
At least 1 correct \(= 69/70 = 99\%\)
At least 2 correct \(= 53/70 = 76\%\)
At least 3 correct \(= 17/70 = 24\%\)
4 correct \(= 1/70 = 1.43\%\)
In the broadest sense, p-values indicate the probability of getting a result >, <, or absolutely greater than as the one in your sample if the null hypothesis is true.
This is a statement about the probability of a data given a model, not a statement about the probability of a model.
Pr(>|t|) = .225 There is a 22.5% chance of seeing a result as extreme as our sample statistic if the null hypothesis is true
Pr(>t) = .04 There is a 4% chance of seeing a result greater than the sample statistic if the null hypothesis is true
Pr(<t) = .95 There is a 95% chance of seeing a result less than the sample statistic if the null hypothesis is true
.05 level means that around 1 in 20 of results would be false positives
In regression output, the p-value typically indicates the probability of seeing a T-statistic as extreme as the one in your sample if the null hypothesis is true:
voteshares$income_over_65k<- voteshares$median_income>65000
gof_stats <- c(
"nobs"
)
coef_names <- c("income_over_65kTRUE" = "Median income over >$65K"
)
model_1 <-lm(demshare ~ income_over_65k, data=voteshares)
modelsummary(list("DV: 2020 statewide Biden %" = model_1),
statistic = c( "p.value"),
gof_map = 'nobs',
coef_rename = coef_names,
notes = "Note: p.values in parentheses"
)| DV: 2020 statewide Biden % | |
|---|---|
| Note: p.values in parentheses | |
| (Intercept) | 0.445 |
| (<0.001) | |
| Median income over >$65K | 0.126 |
| (<0.001) | |
| Num.Obs. | 51 |
Importantly, this assumes that the sampling distribution of coefficients is approximately normal.
There are some methods for relaxing this assumption, however. One method, mostly used for experiments, is randomization inference, where a binary treatment is randomly “shuffled”:
model_1 <-lm(demshare ~ income_over_65k, data=voteshares)
obs_effect<-model_1$coefficients[2]
set.seed(888)
perms<-50000
effects<-numeric(length = perms)
for(i in 1:perms){
permuted <- sample(voteshares$income_over_65k)
mod <-lm(demshare ~ permuted, data=voteshares)
est_effect<-coef(mod)[2]
effects[i] <- est_effect
}
mean((effects >= obs_effect) | (effects <= obs_effect * -1)) * 100[1] 0.008
Another non-parametric method is the bootstrap, which relies on re-sampling observations with replacement and then re-running the regression model.
model_1 <-lm(demshare ~ income_over_65k, data=voteshares)
iterations<-10000
results<-vector('list', length=iterations)
model_data<-model_1$model
for(i in 1:iterations){
booted<-model_data|>slice_sample(n = nrow(model_data), replace=T)
mod <-lm(demshare ~ income_over_65k, data=booted)
results[[i]] <- tidy(mod)
}
result_frame<-bind_rows(results, .id='iteration')
result_frame|>
group_by(term)|>
summarize(
lb = quantile(estimate, .025),
est = median(estimate),
ub = quantile(estimate, .975),
p = min(mean(estimate>0), mean(estimate<0)) * 2
)| term | lb | est | ub | p |
|---|---|---|---|---|
| (Intercept) | 0.417 | 0.445 | 0.474 | 0 |
| income_over_65kTRUE | 0.0628 | 0.126 | 0.192 | 0 |
::: :::::
From: McShane, B. B., & Gal, D. (2017). Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519), 885-895.
Often misinterpreted, even by experts
They’re continuous, but they tend to be treated as binary and conflated with importance
In large samples, they’re hard to read because they get infinitesimally small
In repeated sampling, 95% of the 95% confidence intervals will contain the correct value of \(\beta\)
Its misleading to say there’s a 95% chance that the population \(\beta\) falls between the ci boundaries. (frequentist stats assumes that \(\beta\) is fixed, so it doesn’t have a “chance” of doing anything.)
Better to say “I’m 95% certain this range contains the true value”
Pro: P.values and Confidence Intervals are based on the same basic assumptions, but CIs can be more informative about uncertainty and don’t lend themselves to the same dichotomous interpretation that a p.value does.
Pro: Automatically scaled to the quantity of interest, so harder to confuse with a measure of substantive importance
Pro: can be used for equivalence testing (we’ll come back to this)
One proposed alternative: S-values AKA Surprisal or Shannon Entropy
\[\text{S value} = -log_2(\text{P})\]
Increasing values as evidence gets stronger, and the sizes are less extreme:
| P value | S Value |
|---|---|
| 0.9 | 0.01 |
| 0.06 | 4.05 |
| 0.05 | 4.32 |
| 0.01 | 6.64 |
| 0.001 | 10 |
| 0.000000001 | 30 |
| 1e-100 | 332 |
Pro: Still based on the same assumptions as a p-value, its just a transformation of the p.value that might be more intuitive for people.
Pro: easy calculation (for a computer, at least)
Con: Not widely adopted. All of this stuff relies on convention.
Bayesian methods potentially allow us to do the thing we wish we could do with p-values: quantify hypothesis probabilities.
One downside is that it sort of requires everyone to adopt a totally different philosophical approach to probabilities: Bayesian statistics assume that the model parameters also have a distribution of their own, and we use data to update our knowledge about the nature of those distributions.
Another downside is that Bayesian inference requires us to come up with “prior” distributions for model parameters, which invariably involves some subjectivity.
Bayes Factors measure the weight of evidence in favor of one hypothesis relative to another.
Wagenmakers (2007) suggests a frequentist approximation of a Bayes Factor using the Bayesian Information Criterion.
Run a model with the variable of interest (model 1)
Run a model without the variable of interest (model 0)
Calculate the BIC for both
Calculate \(\frac{\exp(\text{BIC}_1)/2}{\exp(\text{BIC}_0)/2}\)
ff<-read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/gov_transfers.csv")
model_0<-lm(Support ~ Income_Centered + Education + Age, data=ff)
model_1<-lm(Support ~ Participation + Income_Centered + Education + Age, data=ff)
exp(-BIC(model_1)/2)/exp(-BIC(model_0)/2)[1] 7.509412
Numbers greater than 1 indicate evidence in favor of model 1, numbers less than 1 indicate evidence against model 1.
R-squared can be used for model comparison, but it doesn’t tell you a model is a good predictor.
Predicting Democratic voteshare using the 2016 share
| state | demshare | demshare_2016 | GEOID | pop | median_age | white_pop | bachelors_degree | median_income | income_over_65k |
|---|---|---|---|---|---|---|---|---|---|
| Alabama | 0.371 | 0.356 | 01 | 4.89e+06 | 39.2 | 3.3e+06 | 5.47e+05 | 5.2e+04 | FALSE |
| Alaska | 0.447 | 0.416 | 02 | 7.37e+05 | 34.6 | 4.67e+05 | 9.02e+04 | 7.78e+04 | TRUE |
| Arizona | 0.502 | 0.481 | 04 | 7.17e+06 | 37.9 | 5.29e+06 | 9.11e+05 | 6.15e+04 | FALSE |
| Arkansas | 0.358 | 0.357 | 05 | 3.01e+06 | 38.3 | 2.27e+06 | 3.09e+05 | 4.95e+04 | FALSE |
| California | 0.649 | 0.661 | 06 | 3.93e+07 | 36.7 | 2.21e+07 | 5.76e+06 | 7.87e+04 | TRUE |
| Colorado | 0.569 | 0.527 | 08 | 5.68e+06 | 36.9 | 4.63e+06 | 1.02e+06 | 7.52e+04 | TRUE |
\[ F = \frac{(RSS_{restricted} - RSS_{full})}{df_{restricted} - df_{full}} \div \frac{RSS_{full}}{df_{full}} \]
We can continue to include additional covariates in the model. But the “fair comparison” here is probably the same model -1 covariate (rather than comparing to the null model)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 50 | 0.758 | ||||
| 49 | 0.00896 | 1 | 0.749 | 4.08e+03 | 4.25e-48 |
| 48 | 0.00881 | 1 | 0.00015 | 0.816 | 0.371 |
\[{\text{BIC} =n\ln(RSS/n)+k\ln(n)\ }\]
\[{\text{AIC} =n\ln(RSS/n)+2k)\ }\]
Note that both the AIC and BIC include a penalty for k, which is the number of model parameters. So both will penalize models with garbage predictors.
Lower values of both AIC and BIC indicate better model fits:
Basic method: set aside some observations, create model on the training set and then predict the test set
For small samples, K-fold or LOOCV are preferable: leave out one, predict, and then do it again. Then average the metric.
Linear Regression
51 samples
1 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 50, 50, 50, 50, 50, 50, ...
Resampling results:
RMSE Rsquared MAE
0.01380351 0.9871893 0.01096933
Tuning parameter 'intercept' was held constant at a value of TRUE
…but is this the right metric?
Conventional hypothesis testing “stacks the deck” in favor of finding no relationship, and failure to reject the null could simply be a function of small sample sizes.
Still, we often want to dismiss a claim:
Can citizen pressure campaigns make states more effective?
Do motor voter laws improve turnout?
Are some people actually psychic?
“Two one-sided tests”
Step 1: Set a minimum effect size (\(\Delta\)) worth caring about
Step 2: Conduct one-sided t-test for:
H01: \(\Delta \leq −Δ_l\)
H02: \(\Delta \geq Δ_u\)
If both of these tests are rejected, then we can conclude the observed effect is smaller than the minimum effect size.
Eskine (2013) found that people exposed to organic foods became morally judgmental.
Moery and Calin-Jageman (2016) attempt to replicate this finding
Moery and Calin-Jageman’s replication results look like this:
| Group | N | Mean | SD |
|---|---|---|---|
| Control | 95 | 5.25 | 0.95 |
| Treatment | 89 | 5.22 | 0.83 |
The estimated effect: 5.25 - 5.22 = 0.03
Minimal effect size \(\Delta = 0.43\)
What is the probability of seeing a mean difference greater than or equal to 0.03 if the true difference is 0.43?
What is the probability of seeing a mean difference less than or equal to to 0.03 if the true difference is -0.43?
There’s also an equivalent method that relies on 90% confidence intervals instead of T.tests:
“Hypothesis 1: Increasing the effective number of ethnic groups from the 10th percentile (1.06) to the 90th percentile (2.48) will not lead to a substantively meaningful change in the effective number of political parties when the district magnitude is one.”
“Hypothesis 2: Increasing the district magnitude from one to seven will not lead to a substantively meaningful change in the effective number of political parties when the effective number of ethnic groups is one.”
Substantively meaningful effect: changing the effective number of parties by 0.62 (this is the difference in the effective number of political parties between the U.S. and the U.K.)
For instance, this model includes a polynomial term for percent Bachelor’s degrees in a given state. What is the effect of a one unit increase in Bachelor’s degrees, and is this statistically significant?
Call:
lm(formula = demshare ~ poly(pct_bachelors, 2, raw = T), data = voteshares)
Residuals:
Min 1Q Median 3Q Max
-0.188633 -0.071404 0.007378 0.046200 0.214193
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.595663 0.391311 1.522 0.1345
poly(pct_bachelors, 2, raw = T)1 -0.057409 0.058086 -0.988 0.3279
poly(pct_bachelors, 2, raw = T)2 0.003591 0.002125 1.690 0.0975 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08616 on 48 degrees of freedom
Multiple R-squared: 0.5299, Adjusted R-squared: 0.5103
F-statistic: 27.05 on 2 and 48 DF, p-value: 1.356e-08
There’s not a straightforward way to calculate an effect or a confidence interval here analytically, so we need to use other methods.
One option is bootstrapping, another is simulation.
2.5% 50% 97.5%
0.03353758 0.04420986 0.05611655
The marginal effects package will do this automatically:
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0.0441 0.00606 7.28 <0.001 41.4 0.0322 0.056
Term: pct_bachelors
Type: response
Comparison: +1